References

Load packages

library(tidyverse)
library(DPpackage)
## library(MCMCpack)

Dirichlet Process and Dirichlet Process Mixture

Why Construct a Mixture

The previous example of the Dirichlet Process was based on discrete data with a discrete centering measure \(G_{0}\). So it was not a surprise that we got a posterior distribution of discrete measures \(G\). However, the Dirichlet process results in measures \(G\) that are discrete almost surely (with probability one) regardless of the centering measure \(G_{0}\). Therefore, for density estimation, adding one more layer to the model to use the DP as a mixture distribution is more appealing.

Hierarchical Structure

\[\begin{align*} y_{i} | \theta_{i} &\overset{\text{ind}}{\sim} f_{\theta_{i}}\\ \\ \theta_{i} | G &\overset{\text{iid}}{\sim} G\\ \\ G &\sim DP(M G_{0}) \end{align*} \]

This is interpreted as follows.

  1. Each data point \(y_{i}\) has its own latent parameter value \(\theta_{i}\). Given this latent parameter value, \(y_{i} | \theta_{i}\) follows a parametric distribution with a density function \(f_{\theta_{i}}\).

  2. In turn, each individual-specific latent parameter value \(\theta_{i}\) given a random probability measure \(G\) is an iid sample from the distribution defined by \(G\).

  3. The distribution over the random measure \(G\) is governed by a Dirichlet process with a mass (concentration) parameter \(M\) and a centering measure \(G_{0}\), where \(G_{0}\) should have a support appropriate for \(\theta_{i}\).

Posterior Inference

Give the structure above and vector notations \(\boldsymbol{\theta} = (\theta_{1},\dots,\theta_{n})\) and \(\mathbf{y} = (y_{1}, \dots, y_{n})\). The posterior distribution of the random measure \(G\) given \(\boldsymbol{\theta}\) is the following.

\[\begin{align*} G | \boldsymbol{\theta} &\sim DP \left( MG_{0} + \sum^{n}_{i=1} \delta_{\theta_{i}} \right) \end{align*} \]

Consider the posterior distribution of \(\boldsymbol{\theta}\) given \(\mathbf{y}\), \(p(\boldsymbol{\theta} | \mathbf{y})\). If we integrate out \(\boldsymbol{\theta}\) in the above expression using this posterior distribution, we obtain the following.

\[\begin{align*} G | \mathbf{y} &\sim \int DP \left( MG_{0} + \sum^{n}_{i=1} \delta_{\theta_{i}} \right) \text{d} p(\boldsymbol{\theta} | \mathbf{y}) \end{align*} \]

Load data

This is an example from the BNPDA book (p13).

## https://web.ma.utexas.edu/users/pmueller/bnp/R/DP/eig121/EIG.txt
data1 <- read.table(header = TRUE, text = "
  id recorded-EIG121 imputed-EIG121
  1      NA 0.00941
  2 0.00533 0.00533
  3 0.02310 0.02310
  4      NA 0.67871
  5 0.00008 0.00008
  6 0.27840 0.27840
  7 0.06530 0.06530
  8      NA 0.03108
  9      NA 0.03108
 10 0.00060 0.00060
 11 0.02750 0.02750
 12      NA 0.03108
 13 0.03060 0.03060
 14      NA 0.03108
 15      NA 0.67871
 16 0.03300 0.03300
 17      NA 0.03108
 18      NA 0.03108
 19      NA 0.67871
 20 0.09650 0.09650
 21 0.02560 0.02560
 22 0.01670 0.01670
 23 0.06980 0.06980
 24 0.00730 0.00730
 25      NA 0.03108
 26      NA 0.77071
 27      NA 0.03108
 28 0.00500 0.00500
 29      NA 0.03108
 31 0.00891 0.00891
 32      NA 0.22861
 33      NA 0.22861
 34 0.02630 0.02630
 35      NA 0.22861
 36      NA 0.03108
 37      NA 0.03108
 38      NA 0.03108
 39      NA 0.00941
 40 0.01357 0.01357
 41 0.01759 0.01759
 42 0.83840 0.83840
 43 1.18167 1.18167
 44 0.01540 0.01540
 45 0.67239 0.67239
 46 0.09752 0.09752
 47 0.16455 0.16455
 48 0.55606 0.55606
 49 0.64961 0.64961
 50 0.01287 0.01287
 51 0.00192 0.00192
 52 0.00043 0.00043
 53 0.01897 0.01897
 54 0.00036 0.00036
 55 0.23272 0.23272
 56 0.43327 0.43327
 57 1.10649 1.10649
 58 0.85073 0.85073
 59 0.00318 0.00318
 60 0.03394 0.03394
 61 0.13869 0.13869
 62 0.00345 0.00345
 63 0.22849 0.22849
 64 0.28294 0.28294
 65 0.75363 0.75363
 66 0.18504 0.18504
 67 0.01930 0.01930
 68 1.05398 1.05398
 69 0.02071 0.02071
 70 0.00295 0.00295
 71 0.00394 0.00394
 72 0.32358 0.32358
 73 0.01006 0.01006
 ")

Specific Model

As outlined in this program, we will use the following model.

\[\begin{align*} y_{i} | \theta_{i} &\overset{\text{ind}}{\sim} N(\theta_{i}, \sigma^{2})\\ \\ \theta_{i} | G &\overset{\text{iid}}{\sim} G\\ \\ G &\sim DP(M, N(0,4))\\ \\ \frac{1}{\sigma} &\sim Gamma(1, 1)\\ \end{align*} \]

The model implemented in DPdensity is the following. See the manual for detail.

\[\begin{align*} y_{i} | (\mu_{i}, \Sigma_{i}) &\overset{\text{ind}}{\sim} N(\mu_{i},\Sigma_{i})\\ \\ (\mu_{i}, \Sigma_{i}) | G &\overset{\text{iid}}{\sim} G\\ \\ G | (\alpha, G_{0}) &\sim DP(\alpha G_{0})\\ \\ G_{0} &= N \left( \mu \bigg| m_{1}, \frac{1}{k_{0}} \Sigma \right) IW \left( \Sigma | \nu_{i}, \psi_{1} \right)\\ \\ \alpha | (a_{0},b_{0}) &\sim Gamma(a_{0},b_{0})\\ m_{1} | (m_{2},s_{2}) &\sim N(m_{2},s_{2})\\ k_{0} | (\tau_{1},\tau_{2}) &\sim Gamma \left( \frac{\tau_{1}}{2}, \frac{\tau_{2}}{2} \right)\\ \psi_{1} | (\nu_{2}, \psi_{2}) &\sim IW(\nu_{2}, \psi_{2})\\ \end{align*} \]

Fitting

mcmc <- list(nburn = 1000,
             nsave = 10000,
             nskip = 10,
             ndisplay = 100)

## Example of Prior information 1
## Fixing alpha, m1, and Psi1
prior1 <- list(alpha=1,m1=rep(0,1),psiinv1=diag(0.5,1),
               nu1=4, tau1=1,tau2=100)
## Example of Prior information 2
## Fixing alpha and m1
prior2 <- list(alpha=1,m1=rep(0,1),
               psiinv2=solve(diag(0.5,1)),
               nu1=4,nu2=4,tau1=1,tau2=100)
## Example of Prior information 3
## Fixing only alpha
prior3 <- list(alpha=1,
               m2=rep(0,1),s2=diag(100000,1),
               psiinv2=solve(diag(0.5,1)),
               nu1=4,nu2=4,tau1=1,tau2=100)
## Example of Prior information 4
## Everything is random (alpha, m1, nd Psi1)
prior4 <- list(a0=2,b0=1,m2=rep(0,1),s2=diag(100000,1),
               psiinv2=solve(diag(0.5,1)),
               nu1=4,nu2=4,tau1=1,tau2=100)

## Fit the models
state <- NULL
fit1.1 <- DPdensity(y=data1$imputed.EIG121,
                    prior=prior1,
                    mcmc=mcmc,
                    state=state,
                    status=TRUE)
fit1.2 <- DPdensity(y=data1$imputed.EIG121,
                    prior=prior2,
                    mcmc=mcmc,
                    state=state,
                    status=TRUE)
fit1.3 <- DPdensity(y=data1$imputed.EIG121,
                    prior=prior3,
                    mcmc=mcmc,
                    state=state,
                    status=TRUE)
fit1.4 <- DPdensity(y=data1$imputed.EIG121,
                    prior=prior4,
                    mcmc=mcmc,
                    state=state,
                    status=TRUE)

Check posterior means.

fit1.1
## 
## DPM of normals model for density estimation
## 
## Call:
## DPdensity.default(y = data1$imputed.EIG121, prior = prior1, mcmc = mcmc, 
##     state = state, status = TRUE)
## 
## Posterior Predictive Distributions (log):
##      Min.    1st Qu.     Median       Mean    3rd Qu.       Max.  
## -2.734675  -0.007213   0.816516   0.229982   0.876233   0.914218  
## 
## Posterior Inference of Parameters:
##       k0  ncluster  
##  0.03179   2.52330  
## 
## Number of Observations: 72
## Number of Variables: 1
fit1.2
## 
## DPM of normals model for density estimation
## 
## Call:
## DPdensity.default(y = data1$imputed.EIG121, prior = prior2, mcmc = mcmc, 
##     state = state, status = TRUE)
## 
## Posterior Predictive Distributions (log):
##    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.  
## -3.6079  -0.8741   2.1831   0.9550   2.3783   2.7068  
## 
## Posterior Inference of Parameters:
##         k0  psi1-data1    ncluster  
##    0.02394   427.65153     5.55130  
## 
## Number of Observations: 72
## Number of Variables: 1
fit1.3
## 
## DPM of normals model for density estimation
## 
## Call:
## DPdensity.default(y = data1$imputed.EIG121, prior = prior3, mcmc = mcmc, 
##     state = state, status = TRUE)
## 
## Posterior Predictive Distributions (log):
##    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.  
## -5.8361  -0.8339   2.1825   0.9013   2.3796   2.6754  
## 
## Posterior Inference of Parameters:
##   m1-data1          k0  psi1-data1    ncluster  
##    0.18102     0.01754   474.09594     6.64690  
## 
## Number of Observations: 72
## Number of Variables: 1
fit1.4
## 
## DPM of normals model for density estimation
## 
## Call:
## DPdensity.default(y = data1$imputed.EIG121, prior = prior4, mcmc = mcmc, 
##     state = state, status = TRUE)
## 
## Posterior Predictive Distributions (log):
##      Min.    1st Qu.     Median       Mean    3rd Qu.       Max.  
## -468.4270    -0.7128     2.2117   -21.4283     2.4529     2.5927  
## 
## Posterior Inference of Parameters:
##    m1-data1           k0   psi1-data1     ncluster        alpha  
##    0.312830     0.007534  3201.080112    12.233800     3.446769  
## 
## Number of Observations: 72
## Number of Variables: 1

Plot diagnostics.

plot(fit1.1,ask=FALSE,output="param")

plot(fit1.2,ask=FALSE,output="param")

plot(fit1.3,ask=FALSE,output="param")

plot(fit1.4,ask=FALSE,output="param")

Plot specific aspects.

## Extracting the posterior mean of the specific
## means and covariance matrices
## (only prior 2 for illustration)
DPrandom(fit1.2)
## 
## Random effect information for the DP object:
## 
## Call:
## DPdensity.default(y = data1$imputed.EIG121, prior = prior2, mcmc = mcmc, 
##     state = state, status = TRUE)
## 
## 
## Posterior mean of subject-specific components:
## 
##     mu-data1   var-data1
## 1   0.0193149  0.0003132
## 2   0.0196523  0.0003719
## 3   0.0195060  0.0002753
## 4   0.7278682  0.0338056
## 5   0.0196232  0.0003572
## 6   0.2668811  0.0086877
## 7   0.0822295  0.0028461
## 8   0.0202380  0.0003070
## 9   0.0204050  0.0003498
## 10  0.0192207  0.0003046
## 11  0.0200893  0.0003247
## 12  0.0200863  0.0003103
## 13  0.0204879  0.0003364
## 14  0.0204447  0.0003411
## 15  0.7277419  0.0338248
## 16  0.0206074  0.0003650
## 17  0.0206461  0.0003712
## 18  0.0202675  0.0003533
## 19  0.7280343  0.0338045
## 20  0.1292736  0.0044645
## 21  0.0198303  0.0003246
## 22  0.0190999  0.0002684
## 23  0.0952444  0.0035652
## 24  0.0191947  0.0003125
## 25  0.0205857  0.0003622
## 26  0.7371182  0.0347540
## 27  0.0202673  0.0003398
## 28  0.0193754  0.0003261
## 29  0.0205024  0.0003356
## 30  0.0191687  0.0002961
## 31  0.2342404  0.0050618
## 32  0.2338385  0.0049705
## 33  0.0196127  0.0002740
## 34  0.2342252  0.0050059
## 35  0.0205750  0.0003435
## 36  0.0204956  0.0003459
## 37  0.0204189  0.0003263
## 38  0.0191022  0.0002895
## 39  0.0192178  0.0002959
## 40  0.0194586  0.0002945
## 41  0.7462193  0.0353005
## 42  0.8664941  0.0346518
## 43  0.0195951  0.0003283
## 44  0.7279628  0.0339209
## 45  0.1309314  0.0045681
## 46  0.2110084  0.0068066
## 47  0.7143412  0.0345703
## 48  0.7267252  0.0338672
## 49  0.0193588  0.0003047
## 50  0.0192636  0.0003215
## 51  0.0193367  0.0003230
## 52  0.0194252  0.0002907
## 53  0.0194776  0.0003239
## 54  0.2351395  0.0050733
## 55  0.6209941  0.0329175
## 56  0.8653310  0.0346267
## 57  0.7493435  0.0353119
## 58  0.0191919  0.0002961
## 59  0.0206568  0.0003171
## 60  0.1838068  0.0065993
## 61  0.0194561  0.0003292
## 62  0.2346873  0.0051875
## 63  0.2702528  0.0091336
## 64  0.7352495  0.0347099
## 65  0.2217447  0.0059956
## 66  0.0194304  0.0002755
## 67  0.8605347  0.0347827
## 68  0.0195886  0.0003098
## 69  0.0192587  0.0003104
## 70  0.0192932  0.0002940
## 71  0.3484460  0.0167453
## 72  0.0190817  0.0002805
## Ploting predictive information about the specific
## means and covariance matrices
## with HPD and Credibility intervals
## (only prior 2 for illustration)
## (to see the plots gradually set ask=TRUE)
plot(DPrandom(fit1.2,predictive=TRUE),ask=FALSE)

plot(DPrandom(fit1.2,predictive=TRUE),ask=FALSE,hpd=FALSE)

## Ploting information about all the specific means
## and covariance matrices
## with HPD and Credibility intervals
## (only prior 2 for illustration)
## (to see the plots gradually set ask=TRUE)
plot(DPrandom(fit1.2),ask=FALSE,hpd=FALSE)

print(sessionInfo())
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.14.1
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DPpackage_1.1-7.4 forcats_0.3.0     stringr_1.3.1     dplyr_0.7.7       purrr_0.2.5       readr_1.1.1      
##  [7] tidyr_0.8.2       tibble_1.4.2      ggplot2_3.1.0     tidyverse_1.2.1   doRNG_1.7.1       rngtools_1.3.1   
## [13] pkgmaker_0.27     registry_0.5      doParallel_1.0.14 iterators_1.0.10  foreach_1.4.4     knitr_1.20       
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5 splines_3.5.1    haven_1.1.2      lattice_0.20-35  colorspace_1.3-2 htmltools_0.3.6 
##  [7] yaml_2.2.0       survival_2.43-1  rlang_0.3.0.1    pillar_1.3.0     glue_1.3.0       withr_2.1.2     
## [13] readxl_1.1.0     modelr_0.1.2     bindrcpp_0.2.2   bindr_0.1.1      plyr_1.8.4       cellranger_1.1.0
## [19] munsell_0.5.0    gtable_0.2.0     rvest_0.3.2      codetools_0.2-15 evaluate_0.12    broom_0.5.0     
## [25] Rcpp_0.12.19     xtable_1.8-3     backports_1.1.2  scales_1.0.0     jsonlite_1.5     hms_0.4.2       
## [31] digest_0.6.18    stringi_1.2.4    rprojroot_1.3-2  grid_3.5.1       bibtex_0.4.2     cli_1.0.1       
## [37] tools_3.5.1      magrittr_1.5     lazyeval_0.2.1   crayon_1.3.4     pkgconfig_2.0.2  Matrix_1.2-14   
## [43] MASS_7.3-51      xml2_1.2.0       lubridate_1.7.4  rstudioapi_0.8   assertthat_0.2.0 rmarkdown_1.10  
## [49] httr_1.3.1       R6_2.3.0         nlme_3.1-137     compiler_3.5.1
## Record execution time and multicore use
end_time <- Sys.time()
diff_time <- difftime(end_time, start_time, units = "auto")
cat("Started  ", as.character(start_time), "\n",
    "Finished ", as.character(end_time), "\n",
    "Time difference of ", diff_time, " ", attr(diff_time, "units"), "\n",
    "Used ", foreach::getDoParWorkers(), " cores\n",
    "Used ", foreach::getDoParName(), " as backend\n",
    sep = "")
## Started  2018-11-23 06:01:40
## Finished 2018-11-23 06:02:58
## Time difference of 1.290052 mins
## Used 12 cores
## Used doParallelMC as backend